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ABSTRACT 

We present Oqtans, an open-source workbench for quantitative tran- 
scriptome analysis, that is integrated in Galaxy. Its distinguishing fea- 
tures include customizable computational workflows and a modular 
pipeline architecture that facilitates comparative assessment of tool 
and data quality. Oqtans integrates an assortment of machine learn- 
ing-powered tools into Galaxy, which show superior or equal perform- 
ance to state-of-the-art tools. Implemented tools comprise a complete 
transcriptome analysis workflow: short-read alignment, transcript 
identification/quantification and differential expression analysis. 
Oqtans and Galaxy facilitate persistent storage, data exchange and 
documentation of intermediate results and analysis workflows. We il- 
lustrate how Oqtans aids the interpretation of data from different ex- 
periments in easy to understand use cases. Users can easily create 
their own workflows and extend Oqtans by integrating specific tools. 
Oqtans is available as (i) a cloud machine image with a demo instance 
at cloud.oqtans.org, (ii) a public Galaxy instance at galaxy.cbio.mskcc. 
org, (iii) a git repository containing all installed software (oqtans.org/ 
git); most of which is also available from (iv) the Galaxy Toolshed and 
(v) a share string to use along with Galaxy CloudMan. 
Contact: vipin@cbio.mskcc.org, ratschg@mskcc.org 
Supplementary information: Supplementary data are available at 
Bioinformatics online. 
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1 INTRODUCTION 

Technological advance in large-scale sequencing has revolutio- 
nized molecular biology. Its application to profiling the tran- 
scriptome, the total complement of cellular RNA, called 
RNA-seq, provides an unmatched dynamic range for expression 
quantification and single base pair resolution for the discovery of 
new transcripts (Mortazavi et al, 2008). However, analyzing 
these complex data to their full potential requires computational 
frameworks. 
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Here, we present Oqtans, the online platform for quantitative 
RNA-seq data analysis (online since 2010). Its integration into 
the Galaxy framework ensures transparent and reproducible 
computational analyses. Oqtans provides a Galaxy interface to 
many recently developed RNA-seq analysis tools, and this way 
considerably extends the standard repertoire of the Galaxy tool- 
box (usegalaxy.org). To reach non-expert users and experienced 
developers, we provide the Oqtans tool suite in five incarnations: 

(i) as a cloud machine image (see cloud.oqtans.org for a demo), 

(ii) as a public Galaxy instance at galaxy.cbio.mskcc.org, (iii) as a 
git repository (oqtans.org/git); most of these tools are moreover 
available from (iv) the Galaxy Toolshed and (v) a preconfigured 
share string to launch Galaxy CloudMan using sharing instance 
functionality. 

2 RESULTS 

Oqtans provides a versatile analysis workbench for RNA-seq data 
comprising tools suitable for basic and advanced analysis tasks 
(see Supplementary Table SI for a current list of Oqtans tools and 
Supplementary Table S2 for supported file formats). Their modu- 
lar organization within the Galaxy framework allows advanced 
users to easily customize and extend analysis workflows. 

We showcase Oqtans capabilities in use cases for which all 
data, parameters, intermediate output and final results are 
made public on a Page in our Galaxy cloud instance (see oqtan- 
s.org/usecases). 

As a first use case, we wanted to identify annotated genes that 
were differentially expressed between male and female Drosophila 
melanogaster fruit flies [using data from (Daines et al, 201 1)]. This 
analysis requires three major steps: read alignment, quantification 
and enrichment analysis (Fig. 1 A and B). The chosen Oqtans tools 
were combined in a workflow (Supplementary Fig. SI). 

After starting an Oqtans cloud instance in Amazon Web 
Service EC2 (machine image ami-6 53 7 6a0c) and importing 
the RNA-seq read data from the NCBI short read archive, we 
aligned these to the reference genome. Oqtans currently offers 
three tools for spliced alignments of short reads, Tophat 
(Trapnell et al, 2009), STAR (Dobin et al, 2013) and 
PALMapper (Jean et al, 2010). Subsequently, we determined 
genes that were differentially expressed in males and females 
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Fig. 1. Schematic workflows of the Oqtans use cases. (A) The general 
steps needed to perform the analysis. (B) Tools included in Oqtans used 
for differential expression and GO term enrichment analysis (use case 1). 
The same workflow in the Galaxy instance is shown in Supplementary 
Figure SI 

using DESeq, which tests read counts for statistically significant 
differences (Anders and Huber, 2010). 

To determine enriched Gene Ontology (GO) terms among dif- 
ferentially expressed genes, we supplied the gene list to topGO 
(Alexa et aL, 2006), which we integrated into Oqtans. Its graph- 
ical output highlights expression differences in genes annotated 
with the functions 'reproduction' and 'sex determination', as is 
expected for this comparison between male and female fruit fly 
transcriptomes (see Supplementary Fig. S3). 

The whole experiment excluding short read alignment requires 
~10 min of compute time. Duration of the alignment depends on 
the number and size of compute nodes that can be allocated for 
this task (20 min in our setup with 19 '4x large memory' 
instances on Amazon Web Service). 

Uniquely within Oqtans and through the benefits of the 
Galaxy framework, we can directly compare integrated software 
tools on the same input data. This is of great value for a re- 
searchers who are looking for the most appropriate and accur- 
ate algorithm to analyze their newly generated data. For 
instance, for de novo transcript prediction, the accuracy of the 
alignments is particularly important. We demonstrate this in a 
comparison of the accuracy of introns predicted from spliced 
alignments against the genome annotation generated by 
TopHat and PALMapper (Fig. 2A and see Supplementary 
Section S3 for details). Although alignment accuracy may 
have a negligible effect on the detection of differentially ex- 
pressed annotated genes, it becomes crucial for de novo infer- 
ence of transcripts (isoforms). Owing to the high resolution 
provided by RNA-seq, the discovery of novel transcript iso- 
forms from these data has been a prime analysis goal. In 
Gornitz et al. (2011), the authors compared the accuracy of 
transcript inference by combining different read alignment pro- 
grams (PALMapper, TopHat) with different transcript pre- 
dictors (margin-based Transcript Identification Method, 
Cufflinks). All tools used in this example are integrated into 
Oqtans and can be easily combined in workflows to reproduce 
this and similar comparisons (Fig. 2B) (see Section S3 at 
Supplementary Material for more details). 

3 DISCUSSION 

As high-throughput genome and transcriptome sequencing be- 
comes routine in many laboratories around the world, there is an 
increasing demand for standardized data analysis. Directly asso- 
ciated with this need are accessibility, transparency and 
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Fig. 2. (A) Performance comparison of two alignment programs inte- 
grated in Oqtans, evaluated on the data from the use case in terms of in- 
tron accuracy (see Supplementary Fig. S3 for details). Such comparative 
evaluations are made easy, since the replicability assertion of the Galaxy 
Oqtans setup ensures otherwise identical comparisons. (B) Performance 
comparison from Gornitz et al. (2011), where PALMapper and TopHat 
alignments are processed with the de novo transcript inference tools 
mTIM and Cufflinks, again demonstrating the value of Oqtans for com- 
parisons of analysis tool 

persistency of analysis pipelines. As a Galaxy web server, 
Oqtans brings us closer to these goals (Schultheiss, 2011) for 
the important task of RNA-seq data analysis by providing 
easy access to state-of-the-art analysis tools to a wide audience. 
Importantly, while profiting from many free software develop- 
ment efforts, its user friendly interface abstracts from program- 
ming languages and operating systems, and thus enables even 
inexperienced users to rapidly analyze their RNA-seq data. 
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